|
Poster:
|
pegz |
Date:
|
November 11, 2012 03:33:12am |
|
Forum:
|
texts
|
Subject:
|
Re: Omni Magazine - any proof reading? |
Firstly, thanks for the apology, I'm sorry if I went a bit overboard, it was probably mainly due to embarrassment! Like many others, I've been spreading the word about Omni being available again on several forums, but, alas, without trying to read one first. Maybe I should do some proof reading too :~) I guess my love of I.A. and Omni got the better of me.
Secondly, when I say 'proof reading', I don't expect perfection. I just would have thought that someone might have glanced at the first page of the first issue to be converted, (the one I pasted above), and thought "Hang on, something not quite right here....." before ploughing on through the rest.
|
Poster:
|
pegz |
Date:
|
November 11, 2012 03:58:05am |
|
Forum:
|
texts
|
Subject:
|
Re: Omni Magazine - any proof reading? |
...anyway, if the proof reading was perfect, I'd miss out on glorious lines such as 'Life is adrift in a sea of Radox'!
I know from the context it should be 'radiation', but so much more relaxing to think of it as drifting in pine scented bath salts.......
|
Poster:
|
aibek |
Date:
|
November 11, 2012 06:09:41am |
|
Forum:
|
texts
|
Subject:
|
Re: Omni Magazine - any proof reading? |
The OCR software could delete the obvious gibberish in the text file. It is not difficult for the software to identify it -- just notice that there are a very few dictionary words amongst all the characters on the page. This will take care of the cases where there are whole files with no useful word! (e.g. with Sanskrit texts.)
But it is computationally expensive! What is not even worthy of notice for one file -- like deleting obvious gibberish -- becomes something to consider when you have hundreds or thousands of files, even if the algorithm is straightforward. In the IA case, with all sorts of “derivations” of texts and audio and video of probably thousands of files simultaneously (all of which are computationally intensive) the developers are probably not free to do all that they may wish to!
But if the engineers have not considered this, someone should draw their attention to this! Note that the unedited text file is not a “faithful representation” which should not be touched; the PDF file, or more exactly, the tiff images, are the “faithful representation”.
This post was modified by aibek on 2012-11-11 14:09:41
|
Poster:
|
Jeff Kaplan |
Date:
|
November 11, 2012 08:55:44am |
|
Forum:
|
texts
|
Subject:
|
Re: Omni Magazine - any proof reading? |
we'rewell aware that, to be generous, OCR is less than perfect. an your correct that at scale we would not be able to manually review or correct the abbyy OCR. at this point the main suggestions are to crowdsource (which we do not have the manpower to manage) or have interested folks upload corrected files to new items and use good metadata so that the corrected file items appear high in search results.